Compression techniques for distributed data

ABSTRACT

In one example, uncompressed data is compressed and divided into chunks. Each chunk of the compressed data stream is combined with state information to enable each chunk to be independently decompressed. Each of the compressed chunks is then stored on a different storage device along with its associated state information. A compute operation can then be offloaded to the device or node where each chunk is stored. Each chunk can be independently decompressed for execution of the offloaded operation without transferring all chunks to a central location for decompression and performance of the operation.

FIELD

The descriptions are generally related to computers and morespecifically to data compression and compute offloads.

BACKGROUND

Data compression enables data to be compressed to reduce the size ofdata that is stored and transferred.

BRIEF DESCRIPTION OF THE DRAWINGS

The following description includes discussion of figures havingillustrations given by way of example of implementations of embodimentsof the invention. The drawings should be understood by way of example,and not by way of limitation. As used herein, references to one or more“embodiments” or “examples” are to be understood as describing aparticular feature, structure, and/or characteristic included in atleast one implementation of the invention. Thus, phrases such as “in oneembodiment” or “in one example” appearing herein describe variousembodiments and implementations of the invention, and do not necessarilyall refer to the same embodiment. However, they are also not necessarilymutually exclusive.

FIG. 1A illustrates an example of a compression technique.

FIG. 1B illustrates an example of erasure coding the compressed data ofFIG. 1A.

FIG. 2A illustrates an example of pseudocode for producing parity data.

FIG. 2B illustrates an example of pseudocode for offloading an operationon compressed and distributed data by a master controller.

FIG. 2C illustrates an example of pseudocode for performing the offloadat a storage node.

FIG. 3 illustrates an example of offloading operations on the compresseddata.

FIG. 4 is a flow chart illustrating an example of a method ofcompressing and storing data.

FIG. 5 is flow chart of an example of a method of handling an offloadrequest on compressed data.

FIG. 6 illustrates an example of a disaggregated rack architecture inwhich compression and offload techniques may be implemented.

FIG. 7 provides an exemplary depiction of a computing system in whichcompression and offloading techniques can be implemented.

Descriptions of certain details and implementations follow, including adescription of the figures, which may depict some or all of theembodiments described below, as well as discussing other potentialembodiments or implementations of the inventive concepts presentedherein.

DETAILED DESCRIPTION

The present disclosure describes techniques for compression that canenable individually decompressing a chunk of compressed data at thedevice or node where it resides to enable local offload operations.

Bringing select compute operations close to storage devices can providesignificant performance, power, and scalability advantages. However,when the data is compressed as well as distributed, the offloadoperations cannot run because the compressed data on any given nodetypically may not be decompressed without knowledge of the data on othernodes. For example, if a file is compressed and then split across fivenodes, then a search operation cannot be delegated to the five nodes,since the split portions cannot be independently decompressed.

In contrast, the techniques described in this disclosure enableindividually decompressing each EC (erasure coding) code while retainingthe benefits of encoding a large file. In one example, compressed datais divided into chunks such that no tokens span multiple chunks. Eachchunk is combined with state information to enable independentdecompression. The combined chunks and associated state information canbe erasure coded to generate parity information. Subsequent operationson the compressed data can be offloaded to where each chunk is stored toenable parallel decompression and offloading.

FIG. 1A illustrates an example of a compression technique. Anuncompressed data stream 108 is fed to compression logic 106. Thecompression logic 106 can be software, hardware, or a combination. Thecompression logic 106 compresses the incoming data 108 and outputscompressed data 110. The compression logic 106 can include a codec(coder-decoder) that encodes and compresses the incoming uncompresseddata and decompresses coded compressed data. The compression logic 106compresses the data using a compression algorithm.

A variety of compression algorithms can be used, some of which are moresuitable for certain types of data. Some compression algorithms are“lossless.” Lossless compression algorithms compress the data to ensurerecovery of the uncompressed data without any loss of information.Examples of lossless compression algorithms include Lempel-Ziv (LZ),Lempel-Ziv-Welch (LZW), prediction by partial matching (PPM), Huffmancoding, Run-length encoding (RLE), Portable Network Graphics (PNG),Tagged Image File Format (TIFF), and grammar or dictionary-basedalgorithms. Other compression algorithms are “lossy.” Lossy compressionalgorithms compress data by discarding information that is determined tobe nonessential. Thus, lossy compression is typically irreversible(i.e., data compressed with a lossy compression algorithm typicallycannot be decompressed to its original form). Lossy compressionalgorithms are typically used for multimedia files (e.g., images,streaming video, audio files, or other media data). Lossless compressionis typically used for files that need to be reconstructed without anyloss of information. For example, it is typically undesirable to dropinformation from text files (e.g., emails, records, or other documentscontaining text) or program files (e.g., executable files or otherprogram data files); therefore, such files are typically compressed witha lossless compression algorithm. Compression techniques involve storinginformation (e.g., codec state information) to enable decompression ofthe compressed data. The codec state information can include, forexample, information identifying the type of compression algorithm, amodel, a dictionary, and/or other information to enable decompression ofthe data.

Referring again to FIG. 1A, the compression logic divides the compresseddata into multiple chunks or portions. The compression logic producesstate information associated with each chunk of the compressed data. Inone example, a data reformat block 101 that receives the compressed data110 along with the associated state information (State0-StateX) andcombines the state information with the associated chunk. The datareformat block 101 can include software, hardware, firmware, or acombination. In the example illustrated in FIG. 1A, the data reformatblock 101 prepends the state information to the associated chunk. TheCompressed Data (CDATA) chunks are then written to storage devices(e.g., SSD0-SSDX) along with the state information. Each chunk of agiven compressed data stream is stored in a different storage device itsassociated state information.

Referring now to FIG. 1B, the output of the data reformat block is fedinto erasure coding logic 156, which produces erasure coding (EC) paritydata (EC Parity Data 0-EC Parity Data Y). The EC parity data is thenwritten to storage devices. In the illustrated example, the EC paritydata is written to storage devices allocated for EC (e.g., SSDX-SSDX+Y),which are different than the storage devices storing the chunks andassociated state information. Thus, this technique differs fromconventional compression and erasure coding techniques in several ways.As mentioned above, traditional compression algorithms includedictionary state information only at the beginning of the compresseddata stream. If the compressed data is then divided and stored acrossmultiple storage devices, only the first chunk of the compressed datawould have the state information and the chunks may include splittokens. Here, each chunk includes only a single token and each chunk iscombined with state information to enable independent decompression.Furthermore, in conventional erasure coding techniques, the erasurecoding would be performed on each chunk of the compressed data. Thus, agiven chunk may be further split/and or encoded to include parityinformation. Here, in one example, the parity information is computedfor all the chunks and associated state information together. Each chunktogether with its associated state information is an EC-code. Thecomputed parity information makes up the other EC-codes. In this way,the compression/decompression algorithm is erasure-coding aware in thesense that each chunk with its associated state information is anEC-code.

FIG. 2A illustrates an example of pseudocode for producing parity data.In the example illustrated in FIG. 2A, for i=0 to X (where thecompressed data is split into X+1 chunks), chunk_i is written to disk_i,at line 202. In this example, “chunk_i” includes the compressed datachunk in addition to its associated state information. For j=1 to Y(where Y is a redundancy level that allows Y disks to fail without dataloss), parity information is written to Diskj+X, where the parityinformation is given by an erasure coding function: EC-Encode(galois-functionj, Chunk_0 . . . chunk_X), at line 204. The last chunk,Chunk_X+1, is then written to Disk_0 as part of the next EC-stripe, atline 206.

Thus, each chunk of compressed data is stored with sufficient stateinformation to enable independent decoding of the chunk and ECinformation for the chunks combined with the state information is storedseparately to enable error correction.

FIGS. 2B and 2C are examples of pseudo code of a scheme that enablesexecution of operation on a compressed data stream that is split acrossmultiple nodes with dictionary state for each node stored with thecompressed chunk.

FIG. 2B illustrates an example of pseudocode for offloading an operationon compressed and distributed data by a master controller. Thepseudocode of FIG. 2B illustrates a function calledMasterController::ProcessOperation that can be performed, for example,by a compute node such as the compute node 604 of FIG. 6, discussedbelow. The MasterController::ProcessOperation function receives anoperation B to perform (e.g., an offload-binary B) and the addresses Aof the compressed data.

The pseudocode of the MasterController::ProcessOperation function startsat line 220 with identifying the nodes N₁-N_(k) that contain the data ataddresses A. Determining which nodes store the data at addresses Avaries depending on the implementation. In one example, determining thelocation of data involves either determining locations based on a map oralgorithmically. In one example, the physical location of chunks of datais defined by a node number (e.g., sled number), disk number, and asector range on that disk (e.g., logic block address (LBA) where thecode is stored). In the illustrated example, the default codec state,S₀, is then initialized, at line 222. After the nodes storing thecompressed data are identified, the request is sent to all the nodes(nodes 1-k), at lines 224-226. Because the chunks can be independentlydecompressed, the request can be sent concurrently to all nodes. Therequest includes the operation (B) to be performed and the address rangeA of the compressed data.

FIG. 2C illustrates an example of pseudocode for performing the offloadat a storage node. The pseudocode of FIG. 2C illustrates a functioncalled Storage Node::ProcessOperation that can be performed, forexample, by a storage node that stores a compressed chunk, such as thestorage node 606 of FIG. 6. The function StorageNode::ProcessOperationreceives the offload-binary B, and the addresses A of the compresseddata.

The pseudocode of the StorageNode::ProcessOperation function starts atlines 240-242 with removing the head of the addresses A and storing thehead in Ai. Thus, Ai stores the head of the address storing the nextcompressed data on which to perform the operation B. The remainingaddresses (A-Ai) are then stored in RemainingA, at line 244. The storagenode reads the compressed data at the addresses Ai and stores thecompressed data in Ci, at line 246. At line 248, the storage node thenextracts the codec state from Ai and stores it in Si. At line 250, thecodec state is programmed based on the extracted codec state. In thisexample, the codec state for each chunk is extracted from that chunk,making it unnecessary to receive codec state from other nodes. Thestorage node then decompresses the data Ci, at line 252. The output ofthe decompression is the decompressed data Di. The storage node thenexecutes B on the decompressed data Di and sends results to the mastercontroller, at line 254. The compressed data addresses A is then updatedwith the remaining addresses at line 256. The operations are thenrepeated until all the compressed data addresses have been processed(e.g., the RemainingA is empty), at line 258.

Consider an exemplary scenario in which the uncompressed data-streamdata is 2 MB. Referring to FIG. 1A, the 2 MB of uncompressed data 108 isprovided to the compression logic 106. In this example, the data iscompressed with a 2× compression ratio using a dictionary-basedcompression technique. Thus, in this example, the uncompressed data iscompressed to 1 MB of compressed data plus dictionary state information(DictStaten), which is 5 KB for each chunk of 95 kB of the 1 MBcompressed data. The compressed data 108 is logically divided intoeleven chunks, the first ten of which are 95 KB each (95=100 kB codeword size−5 kB dictionary state size), and the eleventh chunk, which is74 kB in this example (1024 kB size of compressed data minus 950 kB inthe first ten chunks). This logical division is approximate, and inpractice ensures that none of the compressed tokens span multiplechunks. Each chunk is then fed to the data reformat logic 101, whichprepends each chunk CDATAN with STATEN.

The prepending operation results in ten chunks (including the stateinformation) which are 100 kB and the eleventh chunk, which is 79 kB.The last chunk may then be padded with zeroes for simplicity, or thespace may be used for the next data stream. Assuming in this examplethat there are 10 SSDs for storing the chunks (SSD0-SSD9), the lastchunk (CDATA10) is written with its state information to disk 0 afterthe location where CDATA0 was written. The resulting eleven chunks(including the state information) of 100 kB each are saved with 10+4redundancy (e.g., EC protection). Therefore, in this example thecompressed data must be encoded across 14 disks, with a redundancy levelthat allows for 4 disks to fail without data loss. In this example, eachchunk (CDATA0-CDATA9) is stored on a different storage device along withits associated state information. The parity information (EC Parity Data0-EC Parity Data 3) is stored on storage devices other than where thecompressed data chunks are stored. In this example, the EC code-wordsize is 100 kB. In this example, each chunk prepended with the stateinformation is considered a code (even though it is not combined withparity information in this example) and the four “EC Parity Data” arethe remaining 4 codes for a total of 14 codes. For example, if the first10 codes (CDATA0-CDATA9 and associated state information) are stored onSSD0-SSD9, the remaining 4 codes (parity information) may be storedacross SSD10-SSD13. The last chunk (CDATA10) can then be stored to SSD0(or another SSD).

Although FIGS. 1A and 1B show separate blocks for the compression logic106 and erasure coding logic 156, the compression logic and erasurecoding logic can be integrated. In one example, the compression and theEC-engine are integrated and termed a CEC (compression plus EC engine).The CEC first compresses the data stream using one of the chosencompression algorithm. CEC then splits the compressed data streams intocodes that provide EC-protection, and that also include the relevantportion of the compression dictionary for that code. CEC also ensuresthat compressed tokens are not split across codes. This scheme canensure that the best global compression scheme/dictionary is used forthe entire data range, and each code can be individually decompressed onits storage node for local offload operation. The downside is thatportions of the dictionary are replicated across the codes, whichslightly decreases the compression ratio.

FIG. 3 illustrates an example of offloading operations on the compresseddata. The example in FIG. 3 shows a processor or node 370 offloading anoperation to a storage device or node 371. The processor 370 can be in aseparate node or a same node as storage device 371. For example, in adisaggregated server architecture, a node with processor 370 may haveprimarily compute resources (e.g., a “compute node”) and the node 371may have primarily storage resources (e.g., a “storage node”). Inanother example, the processor 370 is on a node with the same or similarresources as the node 371. In another example, the processor 370 and thestorage device 371 are a part of the same physical computer.

Regardless of whether the processor and storage device 371 are a part ofthe same or different nodes, the processor 370 is running a process thatneeds to access compressed data that is stored across multiple storagedevices. In conventional systems, the processor 370 would need toretrieve the compressed data from each storage device where it is storedprior to decompressing. After receiving every chunk of a compressed datastream, the processor could decompress the data and perform theoperation. Therefore, in order to perform the operation on thecompressed and distributed data, a significant amount of data istransferred between the processor and storage devices.

In contrast, the technique described herein enables offloading theoperation to where the compressed data chunks are stored, which cansignificantly reduce data transfer. The processor includes or isexecuting offload logic 372 (which can be hardware, software, firmware,or a combination). In one example, the offload logic 372 determineswhere the chunks of compressed data are stored based on a map oralgorithm. In one example, the physical location of the chunks ofcompressed data is defined by one or more of a node number (e.g., slednumber), disk number, and a sector range on that disk (e.g., logic blockaddress (LBA) where the chunk is stored). The offload logic 372 thensends the request 373 to the storage device of node 371. The request 373can include, for example, information identifying the operation to beperformed on the compressed data, an address range for the compresseddata, and/or other operands necessary for performing the operation. Therequest 373 can be transmitted as one or more commands, by writing toone or more registers on the storage device or node 371, via assertionof one or more signals to the storage device or node 371, or any othermeans of communication between the processor or node 370 and the storagedevice or node 371.

In response to receiving the request to perform the operation, thestorage device or node 371 reads the chunk including the associatedstate information (e.g., CDATA0 and STATE0). CDATA0 and STATE0 areprovided to the decompression logic 314. In one example, thedecompression logic 314 can include a codec (coder-decoder) decompressescoded compressed data based on the compression algorithm used forcompression and the associated state information. Unlike conventionalcompression and decompression techniques, in the illustrated example thedecompression logic 314 is able to decompress a single chunk CDATA0independently from the other chunks in the compressed data stream. Eachcompressed token of the compressed data is to span a single chunk, andall information needed to decompress a given chunk is stored with thegiven chunk. Thus, unlike with conventional compression techniques inwhich all the chunks of the compressed data stream need to be receivedprior to decompressing the data stream, each chunk of the compresseddata stream can be independently decompressed. Independent decompressionof the chunks enables the decompression to be performed where the datais located. For example, where the compressed data is distributed acrossmultiple nodes, each node can independently decompress the chunks atthat node. In another example, the storage devices may include circuitry(e.g., compute in memory circuitry) to perform decompression and/orperform operations.

After the decompression logic 314 generates the decompressed data fromthe compressed data (CDATA0) and state information (STATE0), thedecompressed data is sent to processing circuity 315 along with theoperation to be performed. The processing circuitry can include acentral processing unit (CPU), analog processing circuitry, a fieldprogrammable gate array (FPGA), an application specific integratedcircuit (ASIC), accelerators, or other processing and/or controlcircuitry. The processing circuitry can be embedded in the storagedevice or separate from the storage device. The processing circuitry 315performs the operation on DATA0 to obtain a result. The result 375 canthen be provided to the processor or node 370. Providing the result caninclude, for example, transmitting the result or storing the result at aspecified or predetermined location (e.g., a memory location, aregister, etc.). Although only a single storage device/node 371 is shownin FIG. 3, the illustrated technique can be carried out on multiplestorage devices or nodes where the compressed data chunks are stored.The operation can thus be offloaded to the storage devices/nodes at thesame time for concurrent independent decompression of each chunk withouttransferring the chunks to a central place for decompression. In anotherexample, the operation may only need to be performed on one or some ofthe chunks of compressed data, in which case the operation can beoffloaded to only the devices or nodes storing the chunks of interest.Thus, the solution presented here enables parallel operation of theoffload, and also allows operation of the offload on a selectedsub-range of a file/data-stream. These advantages come at the expense ofa slight reduction in compression ratio.

Thus, independent decompression of chunks of the compressed data streamwhere the chunks are located can significantly reduce the amount of datatransferred in order to perform an operation on compressed data. Insteadof transferring each chunk of compressed data to a central location fordecompression, in this example, the only data transferred are theoffload requests and the results of the operation.

Note that the example in FIG. 3 assumes that the CDATA0 combined withSTATE0 can be read without errors. If an error is encountered, thestorage device or node can transfer STATE0 and CDATA0 to the requestingdevice or node for error correction with parity data stored on anotherstorage device. In one such example, all the compressed data chunks andassociated information are transferred to the requesting device or nodeto perform error correction. At this point, the operation can beperformed at the requesting device or node.

FIG. 4 is a flow chart illustrating an example of a method 400 ofcompressing and storing data. The method 400 can be performed byprocessing circuitry, which may include, for example a centralprocessing unit (CPU), analog processing circuitry, a field programmablegate array (FPGA), an application specific integrated circuit (ASIC),accelerators, or other processing and/or control circuitry. In oneexample, the processing circuitry executes instructions of a program toperform the method 400.

The method begins with some uncompressed data that is to be compressedand stored. The uncompressed data may be, for example, text (e.g., adocument with text, email, code, database entries, etc.), a media file(e.g., an image, a video file, a sound file, etc.,), or any other typeof data. The uncompressed data is received by the compression logic(e.g., by a software or hardware codec), at operation 404. Thecompression logic compresses the uncompressed data, at operation 408.The output of the compression algorithm is compressed data, which isdivided into multiple portions or chunks and dictionary stateinformation for each chunk. Thus, in one example, each chunk hasdictionary state information for that single chunk. The dictionary stateinformation for each chunk includes sufficient information forindependent decompression of each chunk. Thus, there is some overlap inthe state information amongst chunks of the compressed data stream.

The method then involves storing the chunks and dictionary stateinformation on different storage devices, at operation 410. Each chunkand its associated state information can be stored on a differentstorage device in the same node, or across multiple nodes. Aftercompressing and storing the data according to this technique, anoperation on the compressed data can then be offloaded to the nodes ordevices storing the chunks. For example, a processor can send a requestto offload an operation to each node or each device storing a chunk of agiven compressed data stream.

FIG. 5 is flow chart of an example of a method 500 of handling anoffload request on compressed data. Some or all of the operations ofmethod 500 can be performed by, for example, processing circuitry on anode where a chunk of compressed data is stored. In another example,some or all of the operations of method 500 can be performed byprocessing circuitry of a storage device that includes embeddedprocessing circuitry.

The method 500 starts with receiving a request to offload an operationon a chunk of compressed data, at operation 502. Examples of operationsinclude: search operations, replacement, data transformation such asencryption, other stream based offloads, etc. The compressed data isthen read from the storage device and decompressed with decompressionlogic, at operation 504. Decompressing the chunk is achieved with thestate information stored with the chunk, and without any of the otherchunks of the compressed data stream. After decompression, the operationcan be performed on the compressed data, at operation 506. After theoperation is performed on the data, a result is provided to therequesting device or node, at operation 510. In the event that the datawas modified by the operation, the data can be compressed again afterthe operation and stored on the storage device.

FIG. 6 illustrates an example of a disaggregated rack architecture inwhich compression and offload techniques may be implemented. FIG. 6illustrates an example of a system with K racks 602-1-602-K of computingresources, which may be used in a data center to store and process data.The racks 602-1-602-K can be in a same physical area, or in physicallyor geographically separate areas. In the illustrated example, each rackincludes N compute nodes and M storage nodes, where N and M can vary indifferent racks. For example, for rack 602-1, N may be 10, but for rack602-2, N may be 15. A node is a physical or virtual machine including orhaving access to one or more computing resources. Independent of whethera node is a physical or virtual machine, a node is a unique fault domainwith respect to other nodes. A fault domain is an independent domainwith no single point of failure (e.g., there is redundant cooling,power, and/or network paths). A storage node is a physical computer(e.g., server) including non-volatile storage. In the exampleillustrated in FIG. 6, the storage nodes 606 include solid state drives(SSDs) 616 to store data. The storage nodes 606 also include processingcircuitry 620, which may include one or more of: a central processingunit (CPU), analog processing circuitry, a field programmable gate array(FPGA), an application specific integrated circuit (ASIC), accelerators,or other processing and/or control circuitry. A compute node is aphysical computer including processors. For example, the compute nodes604 include CPUs 608. The compute node 604 also includes storage 610,which can be a solid-state drive or other non-volatile storage. Acompute node can be referred to as a compute sled, blade, shelf,chassis, server, appliance, machine, or computer. Similarly, a storagenode can also be referred to as a storage sled, blade, shelf, chassis,server, appliance, machine, or computer.

The compute node illustrated in FIG. 6 includes CPUs 608, storage 610,input/output (I/O) interface logic 615 and logic 614. The I/O interfacelogic 615 can include hardware and/or software to enable communicationboth within the compute node and with other nodes. The logic 614 caninclude hardware, software, or both to implement the compression,decompression, and offloading techniques described in this disclosure.The storage node 606 includes SSDs 616, processing circuitry 620, I/Ointerface logic 625, and logic 624. The logic 624 can include hardware,software, or both to implement the compression, decompression, andoffloading techniques described in this disclosure. The nodes 604 and606 can include different or additional resources than what is depictedin FIG. 6.

The nodes are communicatively coupled by one or more networks. Forexample, the nodes within the rack can be coupled via an Ethernet orproprietary local area network (LAN). The racks 602-1-602-K can includea switching hub (not shown in FIG. 6) to implement such a network.Multiple racks can be communicatively coupled to one another viagateways between each rack's network and another, external network thatcouples the racks to one another.

The nodes in FIG. 6 are disaggregated in the sense that data centerhardware resources (e.g., compute, memory, storage, and networkresources) can be packaged and installed individually in a rack. Forexample, storage resources are installed in the racks 602 as storagenodes or sleds, and compute resources are installed in the racks 602 ascompute nodes or sleds. Thus, the compute nodes and storage nodes inFIG. 6 differ from conventional servers in that different nodes caninclude a different balance of computing resources and do notnecessarily include all the components of a conventional server. In aconventional rack infrastructure, the computing resources have thegranularity of an entire server computer. Thus, in a traditionalinfrastructure, a deficiency in resources can only be addressed byadding an entire server computer. As an example, to address a deficiencyin CPU processing power, one or more additional servers would be addedto the rack, which would increase the CPU processing power. However, theadditional servers would also increase the storage resources and otherpower consuming elements, which may be unnecessary and even undesirable.Unlike conventional rack architecture, a disaggregated architectureenables addressing deficiencies in resources by adding more of thespecific resources that are lacking without adding additional andunnecessary resources.

Data stored in a datacenter is typically stored across multiple devices,nodes, and or racks to improve load balancing. As discussed above, datamay also be compressed to reduce the resources needed to store andtransmit the data. Compression of data may be lossless or lossy. Anexample of lossless compression involves identifying redundancies indata and encoding the data to eliminate or reduce the redundancy.Additional redundancies can be added to the compressed data to improveavailability. For example, the chunks can be erasure-coded to generatecodes that are stored across multiple nodes.

FIG. 7 provides an exemplary depiction of a computing system 700 inwhich compression and offloading techniques can be implemented. Thecomputing system 700 can be, for example, user equipment, a computer, apersonal computer (PC), a desktop computer, a laptop computer, anotebook computer, a netbook computer, a tablet, a smart phone, embeddedelectronics, a gaming console, a server array or server farm, a webserver, a network server, an Internet server, a work station, amini-computer, a main frame computer, a supercomputer, a networkappliance, a web appliance, a distributed computing system,multiprocessor systems, processor-based systems, or combination thereof.As observed in FIG. 7, the system 700 includes one or more processors orprocessing units 701 (e.g., host processor(s)). The processor(s) 701 mayinclude one or more central processing units (CPUs), each of which mayinclude, e.g., a plurality of general-purpose processing cores. Theprocessor(s) 701 may also or alternatively include one or more graphicsprocessing units (GPUs) or other processing units. The processor(s) 701may include memory management logic (e.g., a memory controller) and I/Ocontrol logic. The processor(s) 701 typically include cache on a samepackage or near the processor.

The system 700 also includes memory 702 (e.g., system memory). Thesystem memory can be in the same package (e.g., same SoC) or separatefrom the processor(s) 701. The system 700 can include staticrandom-access memory (SRAM), dynamic random-access memory (DRAM), orboth. In some examples, memory 702 may include volatile types of memoryincluding, but not limited to, RAM, D-RAM, DDR SDRAM, SRAM, T-RAM orZ-RAM. One example of volatile memory includes DRAM, or some variantsuch as SDRAM. Memory as described herein may be compatible with anumber of memory technologies, such as DDR4 (DDR version 4, initialspecification published in September 2012 by JEDEC), LPDDR4 (LOW POWERDOUBLE DATA RATE (LPDDR) version 4, JESD209-4, originally published byJEDEC in August 2014), WIO2 (Wide I/O 2 (WideIO2), JESD229-2, originallypublished by JEDEC in August 2014), HBM (HIGH BANDWIDTH MEMORY DRAM,JESD235, originally published by JEDEC in October 2013), DDR5 (DDRversion 5, currently in discussion by JEDEC), LPDDR5 (LPDDR version 5,currently in discussion by JEDEC), HBM2 (HBM version 2, currently indiscussion by JEDEC), and/or others, and technologies based onderivatives or extensions of such specifications. In one example, thememory 702 includes a byte addressable DRAM or a byte addressablenon-volatile memory such as a byte-addressable write-in-place threedimensional crosspoint memory device, or other byte addressablewrite-in-place non-volatile memory devices (also referred to aspersistent memory), such as single or multi-level Phase Change Memory(PCM) or phase change memory with a switch (PCMS), NVM devices that usechalcogenide phase change material (for example, chalcogenide glass),resistive memory including metal oxide base, oxygen vacancy base andConductive Bridge Random Access Memory (CB-RAM), nanowire memory,ferroelectric random access memory (FeRAM, FRAM), magneto resistiverandom access memory (MRAM) that incorporates memristor technology, spintransfer torque (STT)-MRAM, a spintronic magnetic junction memory baseddevice, a magnetic tunneling junction (MTJ) based device, a DW (DomainWall) and SOT (Spin Orbit Transfer) based device, a thyristor basedmemory device, or a combination of any of the above, or other memory.

The system 700 also includes communications interfaces 706 and othercomponents 708. The other components may include, for example, a display(e.g., touchscreen, flat-panel), a power supply (e.g., a battery or/orother power supply), sensors, power management logic, or othercomponents. The communications interfaces 706 may include logic and/orfeatures to support a communication interface. For these examples,communications interface 706 may include one or more input/output (I/O)interfaces that operate according to various communication protocols orstandards to communicate over direct or network communication links orchannels. Direct communications may occur via use of communicationprotocols or standards described in one or more industry standards(including progenies and variants). For example, I/O interfaces can bearranged as a Serial Advanced Technology Attachment (SATA) interface tocouple elements of a node to a storage device. In another example, I/Ointerfaces can be arranged as a Serial Attached Small Computer SystemInterface (SCSI) (or simply SAS), Peripheral Component InterconnectExpress (PCIe), or Non-Volatile Memory Express (NVMe) interface astorage device with other elements of a node (e.g., a controller, orother element of a node). Such communication protocols may be utilizedto communicate through I/O interfaces as described in industry standardsor specifications (including progenies or variants) such as thePeripheral Component Interconnect (PCI) Express Base Specification,revision 3.1, published in November 2014 (“PCI Express specification” or“PCIe specification”) or later revisions, and/or the Non-Volatile MemoryExpress (NVMe) Specification, revision 1.2, also published in November2014 (“NVMe specification”) or later revisions. Network communicationsmay occur via use of communication protocols or standards such thosedescribed in one or more Ethernet standards promulgated by IEEE. Forexample, one such Ethernet standard may include IEEE 802.3. Networkcommunication may also occur according to one or more OpenFlowspecifications such as the OpenFlow Switch Specification. Other examplesof communications interfaces include, for example, a local wiredpoint-to-point link (e.g., USB) interface, a wireless local area network(e.g., WiFi) interface, a wireless point-to-point link (e.g., Bluetooth)interface, a Global Positioning System interface, and/or otherinterfaces.

The computing system 700 also includes non-volatile storage 704, whichmay be the mass storage component of the system. Non-volatile types ofmemory may include byte or block addressable non-volatile memory suchas, but not limited to, NAND flash memory (e.g., multi-threshold levelNAND), NOR flash memory, single or multi-level phase change memory(PCM), resistive memory, nanowire memory, ferroelectric transistorrandom access memory (FeTRAM), magnetoresistive random access memory(MRAM) that incorporates memristor technology, spin transfer torque MRAM(STT-MRAM), 3-dimensional (3D) cross-point memory structure thatincludes chalcogenide phase change material (e.g., chalcogenide glass)hereinafter referred to as “3D cross-point memory”, or a combination ofany of the above. For these examples, storage 704 may be arranged orconfigured as a solid-state drive (SSD). The data may be read andwritten in blocks and a mapping or location information for the blocksmay be kept in memory 702. The storage or memory of the system 700 caninclude processing circuitry, enabling some operations described aboveto be performed in compute-in-memory. In one example, the non-volatilestorage 704 stores the chunks and associated state information discussedabove.

The computing system 700 may also include one or more accelerators orother computing devices 710. For example, the computing system 700 mayinclude an Artificial Intelligence (AI) or machine learning acceleratoroptimized for performing operations for machine learning algorithms, agraphics accelerator (e.g., GPU), or other type of accelerator. Anaccelerator can include processing circuitry (analog, digital, or both)and may also include memory within the same package as the accelerator710.

Examples of techniques for compression, decompression, and computeoffloading follow.

In one example, a storage node includes input/output (I/O) interfacelogic to receive a request to perform an operation to access compresseddata, a chunk of the compressed data and its associated stateinformation to be stored on the storage node, and logic (e.g., hardware,software, firmware, or a combination) to decompress the chunk at thestorage node with its associated state information independently fromother chunks of the compressed data, perform the operation on thedecompressed data, and provide a result from the operation. In oneexample, each compressed token of the compressed data is to span asingle chunk. In one example, the state information includes dictionarystate information for a single chunk. In one example, a portion of thestate information for one chunk is replicated in the state informationfor another chunk. In one example, in response to an error in the chunk,the I/O interface logic transfers the chunk to a requesting device forerror correction with parity data stored on another storage node. In oneexample, the storage node is a storage sled and the request is from acompute sled. In another example, the storage node is or includes astorage device, and the request is from a computing device on a samenode as the storage device.

In one example, a compute node includes processing circuitry to compressthe uncompressed data to generate compressed data, divide the compresseddata into chunks, and generate state information for each chunk of thecompressed data, each chunk independently de-compressible with itsassociated state information. The compute node includes input/output(I/O) interface logic to store the compressed data on a plurality ofstorage devices, each chunk of the compressed data to be stored on asame storage device as its associated state information. In one example,the I/O interface logic is to send, to one or more of the plurality ofstorage devices, a request to perform an operation on the compresseddata, each chunk to be independently decompressed and the operation tobe independently performed on each decompressed chunk, and receiveresults of the operation from the one or more storage devices. In oneexample, the processing circuitry is to combine a chunk of compresseddata with its associated state information. In one such example,combining a chunk of compressed data with its associated stateinformation involves prepend or appending the associated stateinformation to the chunk of compressed data. In one example, theprocessing circuitry is to pad one the chunks of compressed data togenerate chunks with equal length. In one example, the processingcircuitry is to further perform erasure coding on the compressed datatogether with the associated state information to generate parity data,and store the parity data to non-volatile storage devices other than theplurality of devices storing the chunks of compressed data. In one suchexample, each of the plurality of storage devices resides on a differentstorage node. In one example, the plurality of storage devices reside ona same node.

In one example, an article of manufacture comprising a computer readablestorage medium having content stored thereon which when accessed causesone or more processors to execute operations to perform a methoddescribed herein. In one example, a method involves receivinguncompressed data, compressing the uncompressed data to generatecompressed data, dividing the compressed data into chunks, generatingstate information for each chunk of the compressed data, each chunkindependently de-compressible with its associated state information, andstoring the compressed data on a plurality of storage devices, eachchunk of the compressed data to be stored on a same storage device asits associated state information

In one example, a method involves receiving, at a storage device, arequest to perform an operation on a chunk of compressed data,decompressing the chunk of compressed data with its associated stateinformation independently from other chunks of the compressed data,performing the operation on the decompressed data, and providing aresult from the operation. In one example, a system includes a pluralityof storage devices to store chunks of compressed data, each chunk of thecompressed data to be stored on a different storage device, each of theplurality of storage devices including: one or more storage arrays tostore a chunk of the compressed data and state information for thechunk, an input/output (I/O) interface to receive a request to performan operation on compressed data, and processing circuitry to: decompressthe chunk of the compressed data independent of other chunks of thecompressed data with state information for the chunk, perform theoperation on the chunk, and provide a result from the operation. In oneexample, the system includes a processor coupled with the plurality ofstorage devices, the processor to send the request to each of theplurality of storage devices to perform the operation on the compresseddata.

Embodiments of the invention may include various processes as set forthabove. The processes may be embodied in machine-executable instructions.The instructions can be used to cause a general-purpose orspecial-purpose processor to perform certain processes. Alternatively,these processes may be performed by specific/custom hardware componentsthat contain hardwired logic circuitry or programmable logic circuitry(e.g., FPGA, PLD) for performing the processes, or by any combination ofprogrammed computer components and custom hardware components.

Elements of the present invention may also be provided as amachine-readable medium for storing the machine-executable instructions.The machine-readable medium may include, but is not limited to, floppydiskettes, optical disks, CD-ROMs, and magneto-optical disks, FLASHmemory, ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards,propagation media or other type of media/machine-readable mediumsuitable for storing electronic instructions. For example, the presentinvention may be downloaded as a computer program which may betransferred from a remote computer (e.g., a server) to a requestingcomputer (e.g., a client) by way of data signals embodied in a carrierwave or other propagation medium via a communication link (e.g., a modemor network connection).

Flow diagrams as illustrated herein provide examples of sequences ofvarious process actions. The flow diagrams can indicate operations to beexecuted by a software or firmware routine, as well as physicaloperations. In one example, a flow diagram can illustrate the state of afinite state machine (FSM), which can be implemented in hardware,software, or a combination. Although shown in a particular sequence ororder, unless otherwise specified, the order of the actions can bemodified. Thus, the illustrated embodiments should be understood only asan example, and the process can be performed in a different order, andsome actions can be performed in parallel. Additionally, one or moreactions can be omitted in various examples; thus, not all actions arerequired in every embodiment. Other process flows are possible.

To the extent various operations or functions are described herein, theycan be described or defined as software code, instructions,configuration, data, or a combination. The content can be directlyexecutable (“object” or “executable” form), source code, or differencecode (“delta” or “patch” code). The software content of the embodimentsdescribed herein can be provided via an article of manufacture with thecontent stored thereon, or via a method of operating a communicationinterface to send data via the communication interface. Amachine-readable storage medium can cause a machine to perform thefunctions or operations described and includes any mechanism that storesinformation in a form accessible by a machine (e.g., computing device,electronic system, etc.), such as recordable/non-recordable media (e.g.,read only memory (ROM), random access memory (RAM), magnetic diskstorage media, optical storage media, flash memory devices, etc.). Acommunication interface includes any mechanism that interfaces to any ofa hardwired, wireless, optical, etc., medium to communicate to anotherdevice, such as a memory bus interface, a processor bus interface, anInternet connection, a disk controller, etc. The communication interfacecan be configured by providing configuration parameters or sendingsignals, or both, to prepare the communication interface to provide adata signal describing the software content. The communication interfacecan be accessed via one or more commands or signals sent to thecommunication interface.

Various components described herein can be a means for performing theoperations or functions described. Each component described hereinincludes software, hardware, or a combination of these. The componentscan be implemented as software modules, hardware modules,special-purpose hardware (e.g., application specific hardware,application specific integrated circuits (ASICs), digital signalprocessors (DSPs), etc.), embedded controllers, hardwired circuitry,etc.

Besides what is described herein, various modifications can be made tothe disclosed embodiments and implementations of the invention withoutdeparting from their scope. Terms used above to describe the orientationand position of features such as ‘top’, ‘bottom’, ‘over’, ‘under’, andother such terms describing position are intended to clarify therelative location of features relative to other features, and do notdescribe a fixed or absolute position. For example, a wafer that isdescribed as the top wafer that is above or over a bottom wafer could bedescribed as a bottom wafer that is under or below a top wafer.Therefore, the illustrations and examples herein should be construed inan illustrative, and not a restrictive sense. The scope of the inventionshould be measured solely by reference to the claims that follow.

What is claimed is:
 1. An article of manufacture comprising a computerreadable storage medium having content stored thereon which whenaccessed causes processing circuitry to execute operations to perform amethod comprising: receiving, at a storage device, a request to performan operation on a chunk of compressed data; decompressing the chunk ofcompressed data with its associated state information independently fromother chunks of the compressed data; performing the operation on thedecompressed data; and providing a result from the operation.
 2. Thearticle of manufacture of claim 1, wherein: each compressed token of thecompressed data is to span a single chunk.
 3. The article of manufactureof claim 1, wherein: the state information includes dictionary stateinformation for a single chunk.
 4. The article of manufacture of claim1, wherein: a portion of the state information for one chunk isreplicated in the state information for another chunk.
 5. The article ofmanufacture of claim 1, the method further comprising: in response to anerror in the chunk, transferring the chunk to the requesting device forerror correction with parity data stored on another storage device. 6.The article of manufacture of claim 1, wherein: the storage deviceresides on a storage node and the request is from a compute node.
 7. Thearticle of manufacture of claim 1, wherein: the storage device resideson a same node as the requesting device.
 8. An article of manufacturecomprising a computer readable storage medium having content storedthereon which when accessed causes processing circuitry to executeoperations to perform a method comprising: receiving uncompressed data;compressing the uncompressed data to generate compressed data; dividingthe compressed data into chunks; generating state information for eachchunk of the compressed data, each chunk independently de-compressiblewith its associated state information; and storing the compressed dataon a plurality of storage devices, each chunk of the compressed data tobe stored on a same storage device as its associated state information.9. The article of manufacture of claim 8, the method further comprising:sending, to one or more of the plurality of storage devices, a requestto perform an operation on the compressed data, each chunk to beindependently decompressed and the operation to be independentlyperformed on each decompressed chunk; and receiving results of theoperation from the one or more storage devices.
 10. The article ofmanufacture of claim 8, wherein: each compressed token of the compresseddata is to span a single chunk.
 11. The article of manufacture of claim8, wherein: the state information includes dictionary state informationfor a single chunk.
 12. The article of manufacture of claim 8, themethod further comprising: combining a chunk of compressed data with itsassociated state information.
 13. The article of manufacture of claim12, wherein combining a chunk of compressed data with its associatedstate information comprises: prepending the associated state informationto the chunk of compressed data.
 14. The article of manufacture of claim8, further comprising: padding one the chunks of compressed data togenerate chunks with equal length.
 15. The article of manufacture ofclaim 8, wherein: a portion of the state information for one chunk isreplicated in the state information for another chunk.
 16. The articleof manufacture of claim 8, the method further comprising: performingerasure coding on the compressed data together with the associated stateinformation to generate parity data; and storing the parity data tonon-volatile storage devices other than the plurality of devices storingthe chunks of compressed data.
 17. The article of manufacture of claim8, wherein: each of the plurality of storage devices resides on adifferent storage node.
 18. The article of manufacture of claim 8,wherein: the plurality of storage devices reside on a same node.
 19. Astorage node comprising: input/output (I/O) interface logic to: receivea request to perform an operation to access compressed data, a chunk ofthe compressed data and its associated state information to be stored onthe storage node; and logic to: decompress the chunk at the storage nodewith its associated state information independently from other chunks ofthe compressed data, perform the operation on the decompressed data, andprovide a result from the operation.
 20. The storage node of claim 19,wherein: each compressed token of the compressed data is to span asingle chunk.